| Citation | Method | Output | Conclusions |
|---|---|---|---|
| TroubleMiner: Mining network trouble tickets Medem, A. ; Akodjenou, M.-I ; Teixeira, R. 20091 | Trouble tickets classification | * Automation process on clustering the free text inside the description of the trouble tickets. * Choosing the correct keywords for the analysis | Using term frequency distance between trouble tickets and similarity between clusters |
| Knowledge Discovery from Trouble Ticketing Reports in a Large Telecommunication Company Temprado, Y. ; Garcia, C. ; Molinero, F.J. 2009 | Data Mining , Text Mining and Machine Learning , Bayes Net, Naïve Bayes | Prediction on the next action of trouble tickets ,Different snapshots were added to the machine learning algorithm for training | Combination of multiple method to construct the recommendation , Using Bayesian for prediction |
| A Bayesian Approach To Stochastic Root Finding 2011 | x | x | x |
| A Fully Bayesian Approach For Unit Root Testing 2011 | x | x | x |
| Online Root-Cause Analysis Of Alarms In Discrete Bayesian 2014 | x | x | x |
| Documents Categorization Based On Bayesian Spanning Tree 2006 | x | x | x |
| Benefits of a Bayesian Approach to Anomaly and Failure 2009 | x | x | x |
List of literature review regarding Bayesian Net :-
1.A real-life application of multi-agent systems for fault diagnosis in the provision of an Internet business service
2.A Bayesian Network approach to diagnosing the root cause of failure
3.sss
| Rules | Description |
|---|---|
| status = ‘Closed’ | Dataset must be closed for complete information |
| network_tt_id is NULL | Dataset must be not related to Network Trouble Ticket |
| trouble ticket type <> PASSIVE | Trouble Ticket must related to the Active elements such as routers, switches , modem , etc |
| installed_date is NOT NULL | This field must have value |
| created_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| product is NOT NULL | This field must have value |
| sub_product is NOT NULL | This field must have value |
| length description > 10 | This field is useful for text analysis |
| rand() | Record selection is in random mode |
| zone | Should selective from different zone , sparse |
For sample purpose - selecting dataset from ZONE KEPONG for the analysis due to this zone has the highest records inside the Trouble Ticket dataset.
Documentation - https://github.com/piersharding/dplyrimpaladb
install.packages(c("RJDBC", "devtools", "dplyr"))
devtools::install_github("jwills/dplyrimpaladb")
install.packages("dplyrimpaladb")
Basic Impala drivers can be downloaded from https://github.com/Mu-Sigma/RImpala/blob/master/impala-jdbc-cdh5.zip
Below is the components required and how to set the class path for the Impala drivers , RJava , RJDBC and dplyr
suppressWarnings(suppressMessages(library("rJava")))
suppressWarnings(suppressMessages(library("RJDBC")))
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("caret")))
suppressWarnings(suppressMessages(library("corrplot")))
suppressWarnings(suppressMessages(library("lazy")))
suppressWarnings(suppressMessages(library("dplyrimpaladb")))
suppressWarnings(suppressMessages(library("rpart")))
suppressWarnings(suppressMessages(library("DiagrammeR")))
.jaddClassPath(c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T)))
.jinit(classpath = c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T)))
dplyr.jdbc.classpath = c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T))
conn <- src_impaladb(dbname='nova', host='10.54.1.151')
## Loading required package: testthat
## [1] "here:"
## [1] FALSE
result <- tbl(conn, sql("select zone from nova.nova_trouble_ticket where zone <> 'null' group by zone order by zone limit 1000"))
as.data.frame(result)
## zone
## 1 ZONE AIR ITAM
## 2 ZONE BANGI
## 3 ZONE BANGSAR
## 4 ZONE BANTING
## 5 ZONE BATU
## 6 ZONE BATU PAHAT
## 7 ZONE BAYAN BARU
## 8 ZONE BINTULU
## 9 ZONE BUKIT ANGGERIK
## 10 ZONE BUKIT MERTAJAM
## 11 ZONE BUKIT RAJA
## 12 ZONE BUTTERWORTH
## 13 ZONE CYBERJAYA
## 14 ZONE GOMBAK
## 15 ZONE IPOH
## 16 ZONE KAJANG
## 17 ZONE KEPONG
## 18 ZONE KERAMAT
## 19 ZONE KINRARA
## 20 ZONE KL CENTRAL
## 21 ZONE KLANG
## 22 ZONE KOTA KINABALU SELATAN
## 23 ZONE KOTA KINABALU UTARA
## 24 ZONE KUCHING
## 25 ZONE KULIM
## 26 ZONE LANGKAWI
## 27 ZONE MALURI
## 28 ZONE MELAKA UTARA
## 29 ZONE MIRI
## 30 ZONE N. SEMBILAN UTARA
## 31 ZONE PANDAN
## 32 ZONE PELANGI
## 33 ZONE PERLIS
## 34 ZONE PETALING JAYA
## 35 ZONE PUCHONG
## 36 ZONE SEBERANG JAYA
## 37 ZONE SENAI
## 38 ZONE SG PETANI
## 39 ZONE SHAH ALAM
## 40 ZONE SIBU
## 41 ZONE SKUDAI PONTIAN
## 42 ZONE STAMPIN
## 43 ZONE SUBANG JAYA
## 44 ZONE TAMAN PETALING
## 45 ZONE TAMPOI
## 46 ZONE TAR
## 47 ZONE TASEK
## 48 ZONE TASIK AMPANG
## 49 ZONE TDI
## 50 ZONE TELUK INTAN
## 51 ZONE TERENGGANU SELATAN
## 52 ZONE TERUNTUM
result <- tbl(conn, sql("select * from nova_trouble_ticket where zone <> 'null' limit 1"))
as.data.frame(apply(as.data.frame(result),2,class))
## apply(as.data.frame(result), 2, class)
## tt_row_id character
## tt_num character
## tt_type character
## tt_sub_type character
## status character
## severity character
## important_message character
## appointment_flag character
## nova_account_name character
## nova_subscriber_num character
## nova_account_num character
## package_row_id character
## created_by character
## category character
## symptom_error_code character
## priority character
## product character
## sub_product character
## package_name character
## network_tt_id character
## swap_order_num character
## cause_category character
## cause_code character
## resolution_code character
## closure_category character
## resolution_team character
## service_affected character
## service_order_num character
## btu_type character
## owner character
## owner_name character
## group_owner character
## owner_position character
## btu_platform character
## dp_location character
## created_date character
## pending_verify_date character
## closed_by character
## closed_date character
## source character
## installed_date character
## description character
## repeat_ticket_count character
## follow_up_ticket_count character
## fdp_device_name character
## fdp_site_name character
## olt_site_name character
## exchange character
## timestamp character
## contact_id character
## contact_name character
## contact_office_phone character
## contact_mobile_phone character
## contact_home_phone character
## contact_email_addr character
## due_date character
## part_num character
## network_layer character
## network_row_id character
## asset_id character
## ptt character
## zone character
## service_point_id character
Sample dataset - Selection trouble tickets only from Zone Kepong. The SQL is define by :-
Zone Kepong contains very rich information especially for the textual analysis and also one of the largest composition of the cause code & the resolution code which is good for the supervised learning.
| Rules | Description |
|---|---|
| a.status like ‘%Closed%’ | Dataset must be closed for complete information |
| network_tt_id = ‘null’ | Dataset must be not related to Network Trouble Ticket |
| trouble ticket type <> PASSIVE | Trouble Ticket must related to the Active elements such as routers, switches , modem , etc. Excluding for now if related to the 3rd party causes , customer behavior and Passive elements |
| installed_date is NOT NULL | This field must have value |
| created_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| product is NOT NULL | This field must have value |
| sub_product is NOT NULL | This field must have value |
| length description > 10 | This field is useful for text analysis |
| rand() | Record selection is in random mode |
| zone | Should selective from different zone , sparse control |
Generated SQL :-
select * from nova_trouble_ticket a join active_code b on (trim(a.cause_code) = trim(b.cause_code)) join exchange_zone c ON (trim(a.exchange)=trim(c.building_id)) and (b.code <> 'PASSIVE' ) where c.zone_name like '%ZONE KEPONG%' and a.status like '%Closed%' and length(a.cause_category) > 1 and length(a.created_date) > 6 and length(a.closed_date) > 6 and length(a.installed_date) > 6 and a.package_name not like '%null%' and a.product not like '%null%' and a.sub_product not like '%null%' and length(a.description) > 10 and network_tt_id = 'null' order by rand() limit 10000 "
Removing non-related fields such as trouble ticket key , trouble ticket number , trouble ticket date etc.
conn <- src_impaladb(dbname='nova', host='10.54.1.151')
## [1] "here:"
## [1] FALSE
result <- tbl(conn, sql("select a.tt_row_id,a.tt_num,a.tt_type,a.tt_sub_type,a.status,a.severity,a.important_message,a.appointment_flag,a.nova_account_name,a.nova_subscriber_num,a.nova_account_num,a.package_row_id,a.created_by,a.category,a.symptom_error_code,a.priority,a.product,a.sub_product,a.package_name,a.network_tt_id,a.swap_order_num,a.cause_category,a.cause_code,a.resolution_code,a.closure_category,a.resolution_team,a.service_affected,a.service_order_num,a.btu_type,a.owner,a.owner_name,a.group_owner,a.owner_position,a.btu_platform,a.dp_location,a.created_date,a.pending_verify_date,a.closed_by,a.closed_date,a.source,a.installed_date,a.description,a.repeat_ticket_count,a.follow_up_ticket_count,a.fdp_device_name,
a.fdp_site_name,a.olt_site_name,a.exchange,a.`timestamp`,a.contact_id,a.contact_name,a.contact_office_phone,a.contact_mobile_phone,a.contact_home_phone,a.contact_email_addr,a.due_date,a.part_num,a.network_layer,a.network_row_id,a.asset_id,a.ptt,a.zone,a.service_point_id , c.zone_name, c.district,c.state, c.region from nova_trouble_ticket a join active_code b on (trim(a.cause_code) = trim(b.cause_code)) join exchange_zone c ON (trim(a.exchange)=trim(c.building_id)) and (b.code <> 'PASSIVE' ) where c.zone_name like '%ZONE KEPONG%' and a.status like '%Closed%' and length(a.cause_category) > 1 and length(a.created_date) > 6 and length(a.closed_date) > 6 and length(a.installed_date) > 6 and a.package_name not like '%null%' and a.product not like '%null%' and a.sub_product not like '%null%' and length(a.description) > 10 and a.network_tt_id = 'null' order by rand() limit 100"))
result <- as.data.frame(result)
Close the connection from Impala
x <- conn$con
class(x) <- c('JDBCConnection')
dbDisconnect(x)
## [1] TRUE
Save the class as the data.frame
df <- as.data.frame(result)
df$contact_name <- NULL
df$contact_home_phone <- NULL
df$contact_email_addr <- NULL
df$contact_office_phone <- NULL
df$contact_mobile_phone <- NULL
df$`tt_row_id` <- NULL
df$`tt_num` <- NULL
df$tt_type <- NULL
df$`created_date` <- NULL
df$`closed_date` <- NULL
df$`installed_date` <- NULL
df$timestamp <- NULL
df$service_point_id <- NULL
df$contact_id <- NULL
df$owner_position <- NULL
df$tt_sub_type <- NULL
df$severity <- NULL
df$status <- NULL
df$important_message <- NULL
df$network_tt_id <- NULL
df$swap_order_num <- NULL
df$appointment_flag <- NULL
df$nova_account_name <- NULL
df$nova_subscriber_num <- NULL
df$nova_account_num <- NULL
df$repeat_ticket_count <- NULL
df$follow_up_ticket_count <- NULL
df$service_order_num <- NULL
df$source <- NULL
df$owner_name <- NULL
df$description <- NULL
df$due_date <- NULL
df$part_num <- NULL
df$zone <- NULL
df$ptt <- NULL
df$asset_id <- NULL
df$network_layer <- NULL
df$network_row_id <- NULL
df$pending_verify_date <- NULL
df$package_row_id <- NULL
df$priority <- NULL
summary(df)
## created_by category symptom_error_code
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## product sub_product package_name
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## cause_category cause_code resolution_code
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## closure_category resolution_team service_affected
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## btu_type owner group_owner
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## btu_platform dp_location closed_by
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## fdp_device_name fdp_site_name olt_site_name
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## exchange zone_name district
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## state region
## Length:100 Length:100
## Class :character Class :character
## Mode :character Mode :character
Looping the columns name and rename it to [column name]+1 as the factor name
for(i in names(df)){
num <- as.numeric(as.factor(df[,i]))-1
df <- cbind(df,num)
names(df)[names(df)=="num"] <- paste(names(df[i]),"_factor",sep = "")
print(paste(names(df[i]),"1",sep = ""))
}
## [1] "created_by1"
## [1] "category1"
## [1] "symptom_error_code1"
## [1] "product1"
## [1] "sub_product1"
## [1] "package_name1"
## [1] "cause_category1"
## [1] "cause_code1"
## [1] "resolution_code1"
## [1] "closure_category1"
## [1] "resolution_team1"
## [1] "service_affected1"
## [1] "btu_type1"
## [1] "owner1"
## [1] "group_owner1"
## [1] "btu_platform1"
## [1] "dp_location1"
## [1] "closed_by1"
## [1] "fdp_device_name1"
## [1] "fdp_site_name1"
## [1] "olt_site_name1"
## [1] "exchange1"
## [1] "zone_name1"
## [1] "district1"
## [1] "state1"
## [1] "region1"
df <- df[27:52]
names(df)
## [1] "created_by_factor" "category_factor"
## [3] "symptom_error_code_factor" "product_factor"
## [5] "sub_product_factor" "package_name_factor"
## [7] "cause_category_factor" "cause_code_factor"
## [9] "resolution_code_factor" "closure_category_factor"
## [11] "resolution_team_factor" "service_affected_factor"
## [13] "btu_type_factor" "owner_factor"
## [15] "group_owner_factor" "btu_platform_factor"
## [17] "dp_location_factor" "closed_by_factor"
## [19] "fdp_device_name_factor" "fdp_site_name_factor"
## [21] "olt_site_name_factor" "exchange_factor"
## [23] "zone_name_factor" "district_factor"
## [25] "state_factor" "region_factor"
Remove the predictors column which might have one unique value which can leads to zero variance result
The list below is non-zero variance variables
df <- df[,-nearZeroVar(df)]
names(df)
## [1] "created_by_factor" "category_factor"
## [3] "symptom_error_code_factor" "product_factor"
## [5] "sub_product_factor" "package_name_factor"
## [7] "cause_category_factor" "cause_code_factor"
## [9] "resolution_code_factor" "closure_category_factor"
## [11] "resolution_team_factor" "service_affected_factor"
## [13] "btu_type_factor" "owner_factor"
## [15] "group_owner_factor" "btu_platform_factor"
## [17] "dp_location_factor" "closed_by_factor"
## [19] "fdp_device_name_factor" "fdp_site_name_factor"
## [21] "olt_site_name_factor" "exchange_factor"
Find the correlation between the variables using Pearson.
correlations <- cor(df, use="pairwise.complete.obs", method="pearson")
print(correlations)
## created_by_factor category_factor
## created_by_factor 1.000000e+00 -0.103518789
## category_factor -1.035188e-01 1.000000000
## symptom_error_code_factor -6.157588e-02 0.093625736
## product_factor 5.092945e-02 -0.004756966
## sub_product_factor 1.647699e-01 -0.070217371
## package_name_factor 6.622391e-02 -0.033616222
## cause_category_factor 3.574845e-02 0.018413519
## cause_code_factor 1.460979e-02 0.049236852
## resolution_code_factor -3.524580e-02 -0.056954507
## closure_category_factor 3.942088e-02 0.093577125
## resolution_team_factor 4.978484e-02 0.196889896
## service_affected_factor -7.054111e-02 -0.332673920
## btu_type_factor -6.849342e-02 -0.044298391
## owner_factor 4.225674e-01 -0.001304302
## group_owner_factor 3.802286e-02 0.175218963
## btu_platform_factor -1.985990e-02 -0.121242359
## dp_location_factor -8.174213e-05 0.090481946
## closed_by_factor 4.225674e-01 -0.001304302
## fdp_device_name_factor -1.955220e-02 0.030928924
## fdp_site_name_factor -1.955220e-02 0.030928924
## olt_site_name_factor -2.774736e-03 -0.002208588
## exchange_factor 4.386255e-02 0.050170300
## symptom_error_code_factor product_factor
## created_by_factor -6.157588e-02 0.050929453
## category_factor 9.362574e-02 -0.004756966
## symptom_error_code_factor 1.000000e+00 -0.092192502
## product_factor -9.219250e-02 1.000000000
## sub_product_factor -4.707342e-02 0.622173565
## package_name_factor 8.139102e-21 0.827404294
## cause_category_factor 1.652147e-01 0.169942179
## cause_code_factor 1.332267e-01 0.019968619
## resolution_code_factor 8.237376e-02 0.265439233
## closure_category_factor -1.033376e-02 -0.057229432
## resolution_team_factor 1.051736e-01 -0.326132300
## service_affected_factor -2.777790e-01 0.246428797
## btu_type_factor -4.769590e-02 0.217192576
## owner_factor 4.493879e-02 -0.010655232
## group_owner_factor 6.987317e-03 -0.250018447
## btu_platform_factor 0.000000e+00 0.159181823
## dp_location_factor 9.742155e-02 -0.036298712
## closed_by_factor 4.493879e-02 -0.010655232
## fdp_device_name_factor -6.792751e-02 0.033258575
## fdp_site_name_factor -6.792751e-02 0.033258575
## olt_site_name_factor -8.481453e-02 0.035622065
## exchange_factor -6.752270e-02 0.049916894
## sub_product_factor package_name_factor
## created_by_factor 0.16476987 6.622391e-02
## category_factor -0.07021737 -3.361622e-02
## symptom_error_code_factor -0.04707342 8.139102e-21
## product_factor 0.62217357 8.274043e-01
## sub_product_factor 1.00000000 6.600184e-01
## package_name_factor 0.66001838 1.000000e+00
## cause_category_factor 0.03649430 2.689180e-02
## cause_code_factor 0.13333300 1.594497e-03
## resolution_code_factor 0.13079713 1.480758e-01
## closure_category_factor -0.12824227 -9.866713e-02
## resolution_team_factor -0.20900820 -2.148615e-01
## service_affected_factor 0.14512655 1.654494e-01
## btu_type_factor 0.13717818 8.594294e-02
## owner_factor 0.01375153 3.725046e-02
## group_owner_factor -0.11187639 -1.622350e-01
## btu_platform_factor 0.13750598 7.485758e-02
## dp_location_factor 0.05008800 4.885970e-02
## closed_by_factor 0.01375153 3.725046e-02
## fdp_device_name_factor -0.03721062 -5.055878e-02
## fdp_site_name_factor -0.03721062 -5.055878e-02
## olt_site_name_factor -0.03907852 -6.910593e-02
## exchange_factor -0.05742457 1.436679e-03
## cause_category_factor cause_code_factor
## created_by_factor 0.035748449 0.014609787
## category_factor 0.018413519 0.049236852
## symptom_error_code_factor 0.165214679 0.133226689
## product_factor 0.169942179 0.019968619
## sub_product_factor 0.036494296 0.133332998
## package_name_factor 0.026891804 0.001594497
## cause_category_factor 1.000000000 0.289349037
## cause_code_factor 0.289349037 1.000000000
## resolution_code_factor 0.424575076 0.131475713
## closure_category_factor 0.145313431 -0.104022198
## resolution_team_factor -0.327183873 0.109750791
## service_affected_factor 0.037499062 0.035906434
## btu_type_factor -0.005276490 -0.119373133
## owner_factor -0.177254643 -0.107275894
## group_owner_factor -0.047123829 -0.027549588
## btu_platform_factor -0.005635697 -0.138019471
## dp_location_factor 0.037863902 0.194969083
## closed_by_factor -0.177254643 -0.107275894
## fdp_device_name_factor -0.110953694 -0.378169436
## fdp_site_name_factor -0.110953694 -0.378169436
## olt_site_name_factor -0.082350746 -0.407144023
## exchange_factor -0.107455667 -0.411389901
## resolution_code_factor closure_category_factor
## created_by_factor -0.03524580 0.03942088
## category_factor -0.05695451 0.09357713
## symptom_error_code_factor 0.08237376 -0.01033376
## product_factor 0.26543923 -0.05722943
## sub_product_factor 0.13079713 -0.12824227
## package_name_factor 0.14807580 -0.09866713
## cause_category_factor 0.42457508 0.14531343
## cause_code_factor 0.13147571 -0.10402220
## resolution_code_factor 1.00000000 0.08181496
## closure_category_factor 0.08181496 1.00000000
## resolution_team_factor -0.46570023 -0.16239538
## service_affected_factor -0.01380502 0.17534826
## btu_type_factor 0.32825269 0.01711272
## owner_factor -0.21393749 -0.14933430
## group_owner_factor -0.24367602 0.17405501
## btu_platform_factor 0.32169090 0.03459707
## dp_location_factor -0.24084516 -0.07656520
## closed_by_factor -0.21393749 -0.14933430
## fdp_device_name_factor 0.22148871 0.01076765
## fdp_site_name_factor 0.22148871 0.01076765
## olt_site_name_factor 0.25584205 0.02624574
## exchange_factor 0.14321242 -0.01938107
## resolution_team_factor service_affected_factor
## created_by_factor 0.04978484 -0.070541108
## category_factor 0.19688990 -0.332673920
## symptom_error_code_factor 0.10517361 -0.277778965
## product_factor -0.32613230 0.246428797
## sub_product_factor -0.20900820 0.145126554
## package_name_factor -0.21486148 0.165449403
## cause_category_factor -0.32718387 0.037499062
## cause_code_factor 0.10975079 0.035906434
## resolution_code_factor -0.46570023 -0.013805023
## closure_category_factor -0.16239538 0.175348258
## resolution_team_factor 1.00000000 -0.174459776
## service_affected_factor -0.17445978 1.000000000
## btu_type_factor -0.23015013 0.025075320
## owner_factor 0.30597762 -0.191518467
## group_owner_factor 0.08907496 -0.108683885
## btu_platform_factor -0.18976157 0.072958874
## dp_location_factor 0.15548815 -0.120601390
## closed_by_factor 0.30597762 -0.191518467
## fdp_device_name_factor -0.13273053 0.004493253
## fdp_site_name_factor -0.13273053 0.004493253
## olt_site_name_factor -0.13371539 -0.001264552
## exchange_factor -0.07617183 0.045046915
## btu_type_factor owner_factor group_owner_factor
## created_by_factor -0.06849342 0.422567422 0.038022862
## category_factor -0.04429839 -0.001304302 0.175218963
## symptom_error_code_factor -0.04769590 0.044938790 0.006987317
## product_factor 0.21719258 -0.010655232 -0.250018447
## sub_product_factor 0.13717818 0.013751528 -0.111876388
## package_name_factor 0.08594294 0.037250458 -0.162235001
## cause_category_factor -0.00527649 -0.177254643 -0.047123829
## cause_code_factor -0.11937313 -0.107275894 -0.027549588
## resolution_code_factor 0.32825269 -0.213937489 -0.243676015
## closure_category_factor 0.01711272 -0.149334300 0.174055014
## resolution_team_factor -0.23015013 0.305977624 0.089074964
## service_affected_factor 0.02507532 -0.191518467 -0.108683885
## btu_type_factor 1.00000000 -0.114728843 -0.210344389
## owner_factor -0.11472884 1.000000000 0.312657631
## group_owner_factor -0.21034439 0.312657631 1.000000000
## btu_platform_factor 0.91743142 -0.140281402 -0.235698575
## dp_location_factor -0.76330054 0.089358585 0.179428745
## closed_by_factor -0.11472884 1.000000000 0.312657631
## fdp_device_name_factor 0.69773957 -0.102563516 -0.123760434
## fdp_site_name_factor 0.69773957 -0.102563516 -0.123760434
## olt_site_name_factor 0.69504545 -0.093581969 -0.151449059
## exchange_factor 0.40493981 -0.073246090 -0.083543009
## btu_platform_factor dp_location_factor
## created_by_factor -0.019859901 -8.174213e-05
## category_factor -0.121242359 9.048195e-02
## symptom_error_code_factor 0.000000000 9.742155e-02
## product_factor 0.159181823 -3.629871e-02
## sub_product_factor 0.137505980 5.008800e-02
## package_name_factor 0.074857575 4.885970e-02
## cause_category_factor -0.005635697 3.786390e-02
## cause_code_factor -0.138019471 1.949691e-01
## resolution_code_factor 0.321690898 -2.408452e-01
## closure_category_factor 0.034597069 -7.656520e-02
## resolution_team_factor -0.189761572 1.554882e-01
## service_affected_factor 0.072958874 -1.206014e-01
## btu_type_factor 0.917431418 -7.633005e-01
## owner_factor -0.140281402 8.935859e-02
## group_owner_factor -0.235698575 1.794287e-01
## btu_platform_factor 1.000000000 -7.786325e-01
## dp_location_factor -0.778632539 1.000000e+00
## closed_by_factor -0.140281402 8.935859e-02
## fdp_device_name_factor 0.679829117 -6.488113e-01
## fdp_site_name_factor 0.679829117 -6.488113e-01
## olt_site_name_factor 0.691038611 -6.205678e-01
## exchange_factor 0.432506881 -3.629443e-01
## closed_by_factor fdp_device_name_factor
## created_by_factor 0.422567422 -0.019552198
## category_factor -0.001304302 0.030928924
## symptom_error_code_factor 0.044938790 -0.067927511
## product_factor -0.010655232 0.033258575
## sub_product_factor 0.013751528 -0.037210624
## package_name_factor 0.037250458 -0.050558778
## cause_category_factor -0.177254643 -0.110953694
## cause_code_factor -0.107275894 -0.378169436
## resolution_code_factor -0.213937489 0.221488714
## closure_category_factor -0.149334300 0.010767650
## resolution_team_factor 0.305977624 -0.132730527
## service_affected_factor -0.191518467 0.004493253
## btu_type_factor -0.114728843 0.697739570
## owner_factor 1.000000000 -0.102563516
## group_owner_factor 0.312657631 -0.123760434
## btu_platform_factor -0.140281402 0.679829117
## dp_location_factor 0.089358585 -0.648811265
## closed_by_factor 1.000000000 -0.102563516
## fdp_device_name_factor -0.102563516 1.000000000
## fdp_site_name_factor -0.102563516 1.000000000
## olt_site_name_factor -0.093581969 0.967847038
## exchange_factor -0.073246090 0.883823338
## fdp_site_name_factor olt_site_name_factor
## created_by_factor -0.019552198 -0.002774736
## category_factor 0.030928924 -0.002208588
## symptom_error_code_factor -0.067927511 -0.084814530
## product_factor 0.033258575 0.035622065
## sub_product_factor -0.037210624 -0.039078518
## package_name_factor -0.050558778 -0.069105930
## cause_category_factor -0.110953694 -0.082350746
## cause_code_factor -0.378169436 -0.407144023
## resolution_code_factor 0.221488714 0.255842048
## closure_category_factor 0.010767650 0.026245741
## resolution_team_factor -0.132730527 -0.133715385
## service_affected_factor 0.004493253 -0.001264552
## btu_type_factor 0.697739570 0.695045446
## owner_factor -0.102563516 -0.093581969
## group_owner_factor -0.123760434 -0.151449059
## btu_platform_factor 0.679829117 0.691038611
## dp_location_factor -0.648811265 -0.620567819
## closed_by_factor -0.102563516 -0.093581969
## fdp_device_name_factor 1.000000000 0.967847038
## fdp_site_name_factor 1.000000000 0.967847038
## olt_site_name_factor 0.967847038 1.000000000
## exchange_factor 0.883823338 0.865384094
## exchange_factor
## created_by_factor 0.043862550
## category_factor 0.050170300
## symptom_error_code_factor -0.067522699
## product_factor 0.049916894
## sub_product_factor -0.057424565
## package_name_factor 0.001436679
## cause_category_factor -0.107455667
## cause_code_factor -0.411389901
## resolution_code_factor 0.143212419
## closure_category_factor -0.019381066
## resolution_team_factor -0.076171825
## service_affected_factor 0.045046915
## btu_type_factor 0.404939806
## owner_factor -0.073246090
## group_owner_factor -0.083543009
## btu_platform_factor 0.432506881
## dp_location_factor -0.362944340
## closed_by_factor -0.073246090
## fdp_device_name_factor 0.883823338
## fdp_site_name_factor 0.883823338
## olt_site_name_factor 0.865384094
## exchange_factor 1.000000000
Find the highest correlated variables.
| Rules | Description |
|---|---|
| - +.70 or higher | Very strong relationship |
| - +.40 to +.69 | Strong positive relationship |
| - +.30 to +.39 | Moderate relationship |
| - +.20 to +.29 | weak relationship |
| - +.01 to +.19 | No or negligible relationship |
# Choose 0.7 Very strong relationship
highlyCorrelated <- findCorrelation(correlations, 0.7 ,verbose = FALSE,names = TRUE)
highlyCorrelated
## [1] "fdp_device_name_factor" "fdp_site_name_factor"
## [3] "olt_site_name_factor" "btu_platform_factor"
## [5] "btu_type_factor" "owner_factor"
## [7] "product_factor"
Summary of the correlated variables.
summary(correlations)
## created_by_factor category_factor symptom_error_code_factor
## Min. :-0.10352 Min. :-0.332674 Min. :-0.27778
## 1st Qu.:-0.01978 1st Qu.:-0.041628 1st Qu.:-0.06604
## Median : 0.02518 Median : 0.008555 Median : 0.00000
## Mean : 0.08851 Mean : 0.048063 Mean : 0.04314
## 3rd Qu.: 0.05064 3rd Qu.: 0.080404 3rd Qu.: 0.09081
## Max. : 1.00000 Max. : 1.000000 Max. : 1.00000
## product_factor sub_product_factor package_name_factor
## Min. :-0.32613 Min. :-0.20901 Min. :-0.21486
## 1st Qu.:-0.01066 1st Qu.:-0.04507 1st Qu.:-0.04632
## Median : 0.03444 Median : 0.02512 Median : 0.03207
## Mean : 0.13376 Mean : 0.11398 Mean : 0.11371
## 3rd Qu.: 0.20538 3rd Qu.: 0.13742 3rd Qu.: 0.08317
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## cause_category_factor cause_code_factor resolution_code_factor
## Min. :-0.32718 Min. :-0.411390 Min. :-0.46570
## 1st Qu.:-0.10118 1st Qu.:-0.116349 1st Qu.:-0.05153
## Median : 0.02265 Median : 0.008102 Median : 0.13114
## Mean : 0.05618 Mean :-0.002953 Mean : 0.10329
## 3rd Qu.: 0.11845 3rd Qu.: 0.126045 3rd Qu.: 0.24725
## Max. : 1.00000 Max. : 1.000000 Max. : 1.00000
## closure_category_factor resolution_team_factor service_affected_factor
## Min. :-0.16240 Min. :-0.46570 Min. :-0.332674
## 1st Qu.:-0.09314 1st Qu.:-0.20420 1st Qu.:-0.117622
## Median : 0.01077 Median :-0.13273 Median : 0.004493
## Mean : 0.03880 Mean :-0.02077 Mean : 0.021590
## 3rd Qu.: 0.07122 3rd Qu.: 0.10861 3rd Qu.: 0.065981
## Max. : 1.00000 Max. : 1.00000 Max. : 1.000000
## btu_type_factor owner_factor group_owner_factor
## Min. :-0.76330 Min. :-0.21394 Min. :-0.25002
## 1st Qu.:-0.10317 1st Qu.:-0.11287 1st Qu.:-0.14453
## Median : 0.02109 Median :-0.04195 Median :-0.06533
## Mean : 0.15933 Mean : 0.07947 Mean : 0.01856
## 3rd Qu.: 0.38577 3rd Qu.: 0.07825 3rd Qu.: 0.15281
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## btu_platform_factor dp_location_factor closed_by_factor
## Min. :-0.77863 Min. :-0.77863 Min. :-0.21394
## 1st Qu.:-0.13383 1st Qu.:-0.33242 1st Qu.:-0.11287
## Median : 0.05378 Median : 0.01889 Median :-0.04195
## Mean : 0.15600 Mean :-0.10292 Mean : 0.07947
## 3rd Qu.: 0.40480 3rd Qu.: 0.09020 3rd Qu.: 0.07825
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## fdp_device_name_factor fdp_site_name_factor olt_site_name_factor
## Min. :-0.648811 Min. :-0.648811 Min. :-0.620568
## 1st Qu.:-0.102564 1st Qu.:-0.102564 1st Qu.:-0.091390
## Median :-0.007529 Median :-0.007529 Median :-0.002492
## Mean : 0.170699 Mean : 0.170699 Mean : 0.169238
## 3rd Qu.: 0.565244 3rd Qu.: 0.565244 3rd Qu.: 0.582240
## Max. : 1.000000 Max. : 1.000000 Max. : 1.000000
## exchange_factor
## Min. :-0.41139
## 1st Qu.:-0.07325
## Median : 0.02265
## Mean : 0.15781
## 3rd Qu.: 0.33951
## Max. : 1.00000
Plot correlated variables.
png(height=1200, width=1200, pointsize=15, file="corrplot.png")
corrplot(correlations, method = "number",tl.cex = 0.9 ,addCoef.col="grey", order = "AOE")
dev.off()
## png
## 2
Feature selection process to confirm which variable does become the independent and resolution code is the dependent variable via GBM (Stochastic Gradient Boosting).
List of other available model - http://topepo.github.io/caret/modelList.html
set.seed(777)
suppressWarnings(suppressMessages(library(mlbench)))
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model <-
train(
resolution_code_factor ~ ., data = df, method = "gbm", preProcess = "scale", trControl =
control , verbose = FALSE
)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.2.2
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
importance <- varImp(model, scale = TRUE)
print(importance)
## gbm variable importance
##
## only 20 most important variables shown (out of 21)
##
## Overall
## cause_code_factor 100.0000
## cause_category_factor 49.9421
## resolution_team_factor 31.3111
## created_by_factor 19.3383
## service_affected_factor 13.2905
## owner_factor 13.2559
## fdp_device_name_factor 11.1100
## btu_type_factor 6.3354
## product_factor 4.9427
## dp_location_factor 4.5966
## closure_category_factor 1.7504
## btu_platform_factor 1.5958
## package_name_factor 1.1886
## group_owner_factor 0.9891
## olt_site_name_factor 0.7943
## sub_product_factor 0.2145
## fdp_site_name_factor 0.0000
## exchange_factor 0.0000
## category_factor 0.0000
## closed_by_factor 0.0000
plot(importance)
So far , the main variables or factors found are :-
as based on the list the importance plot after the tuning and fitting predictive model process.
After the status = Closed
sss
sss
sss
sss
sss
sss
sss
sss
sss
sss
sss
sss
workflow <- print(grViz("
digraph {
# graph attributes
graph [overlap = true]
node [shape = box,
fontname = Calibri,
color = blue]
# edge statements
GET_Zone_Distinct_List -> RETURN_Zone_List -> LOOP_Zone_List -> GENERATE_SQL_filter_by_zone; DESIGN_Base_SQL ->
APPLY_record_filter -> RECEIVE_Zone_Paramater -> GENERATE_SQL_filter_by_zone ->
ORDER_records_in_random_mode -> Limit_records_per_zone -> Download_dataset -> Load_in_R->COVERT_dataset_as_dataframe->REMOVE_unrelated_fields;
Download_dataset-> SAVE_Trouble_Ticket_No_As_Reference;
Download_dataset-> SAVE_Description_Text_As_Reference;
}
"))
workflow
Previously have been mentioned here by applying SQL and condition rules here
Still in progress.
Still in progress.
Still in progress.